Skip to content

Print Timestamp and Resource Usage#193

Open
cfsarmiento wants to merge 42 commits intomainfrom
print-timestamp-and-resource-usage
Open

Print Timestamp and Resource Usage#193
cfsarmiento wants to merge 42 commits intomainfrom
print-timestamp-and-resource-usage

Conversation

@cfsarmiento
Copy link
Copy Markdown
Collaborator

Summary

This PR adds reporting for resource utilization during DPP runs using Prometheus.

Issue

How was it tested?

Tested by running manually within a Podman container. I set up Prometheus within other container networked together to get metric collection. I do not have access to an OpenShift cluster at this moment so I was not able to test within OpenShift, but the script was augmented to work with or without Prometheus.

Output

Output With Prometheus

Compilation:

Screenshot 2026-02-24 at 10 10 02 AM

Inference:

Screenshot 2026-02-24 at 10 12 28 AM

Output Without Prometheus

Compilation:

Screenshot 2026-02-24 at 10 18 11 AM

Inference:

Screenshot 2026-02-24 at 10 21 02 AM

@thanh-lam
Copy link
Copy Markdown

I'm able to clone this branch and will test to see the output. That looks good. One question:
@cfsarmiento - Do you print the outputs on rank0 only? Would it make different numbers if each rank prints out its own numbers?

Also, as @matthew-pisano suggested, the user needs to install the pip package prometheus-api-client. By "conditional", I guess the script can install the package if it's not already there, right?

@cfsarmiento
Copy link
Copy Markdown
Collaborator Author

I'm able to clone this branch and will test to see the output. That looks good. One question: @cfsarmiento - Do you print the outputs on rank0 only? Would it make different numbers if each rank prints out its own numbers?

Also, as @matthew-pisano suggested, the user needs to install the pip package prometheus-api-client. By "conditional", I guess the script can install the package if it's not already there, right?

I included all the changes within the warmup function so on the runs that I have done, there has been output for each rank. I made sure it gets blocked out so that you can see when it starts and ends for each rank. the dprint function handles the number so I used that when appropriate.

As for the package, @matthew-pisano and I were talking this morning and I was saying that I only included that package in there so that I can do type annotations in the functions. I don't want to include that overhead if it is not needed so we landed on just removing the import and type annotations from the main DPP script and having a try/catch in the resource_collection.py file that will ignore the import if the package isn't installed. This assumes that if you do not have the package installed, you do not want to check for resource utilization. That being said, I can also do what you suggested and install the package if it isn't already downloaded. Which path do you suggest I take here, I don't mind doing either.

@thanh-lam
Copy link
Copy Markdown

It's probably better to not install the package. One concern is that the package maybe changed in the future. With that in mind, which is better: installed or not installed?

I just cloned the PR and the script still has the import:

from prometheus_api_client import PrometheusConnect

@cfsarmiento
Copy link
Copy Markdown
Collaborator Author

It's probably better to not install the package. One concern is that the package maybe changed in the future. With that in mind, which is better: installed or not installed?

I just cloned the PR and the script still has the import:

from prometheus_api_client import PrometheusConnect

Yeah, I haven't made any changes yet just to see where we stood on it. I'll go through and rework it and make a new push to the branch for you to consume. I will make it so that people do not have to have it installed to make the script run.

@cfsarmiento
Copy link
Copy Markdown
Collaborator Author

@thanh-lam I pushed that change. It should work without the package, I wrapped that import with a try/catch and all the other code already checks to make sure there is a client and gracefully handles if not. Let me know if you run into any issues.

@cfsarmiento cfsarmiento force-pushed the print-timestamp-and-resource-usage branch from b2fcd8e to 40f0d87 Compare February 25, 2026 03:13
@thanh-lam
Copy link
Copy Markdown

@cfsarmiento - I checked out this branch and tried to do pip install but got a problem with torch version. You use the latest level of aiu-fms-testing-utils (0.7.1) and that has torch 2.10.0. We currently have torch 2.7.1.

pip error messages:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.22.1+cpu requires torch==2.7.1, but you have torch 2.10.0 which is incompatible.
torch-sendnn 1.1.1+0 requires torch<2.8.0,>=2.5.1, but you have torch 2.10.0 which is incompatible.

Can you "backoff" your PR to aiu-fms-testing-utils 0.6.0? Or, open another PR with 0.6.0? This is problematic.... and need someone from aiu-fms-testing-utils to resolve it.

@thanh-lam
Copy link
Copy Markdown

Or, if I just copy your drive_paged_programs.py into my local aiu-fms-testing-utils 0.6.0, would that work? Any other files do I need?

@thanh-lam
Copy link
Copy Markdown

If I use your script with AFTU, it can't find this module:

ModuleNotFoundError: No module named 'aiu_fms_testing_utils.utils.resource_collection'

@cfsarmiento
Copy link
Copy Markdown
Collaborator Author

If I use your script with AFTU, it can't find this module:

ModuleNotFoundError: No module named 'aiu_fms_testing_utils.utils.resource_collection'

You'll need to copy over resource_collection.py as well since drive_paged_programs.py uses those functions in there. I will go through in a little bit and try to set my branch to 0.6.0 so you can just pull that.

@thanh-lam
Copy link
Copy Markdown

Copying drive_paged_programs.py and resource_collection.py into 0.6.0 files dir don't seem to work. This is from resource_collection.py:

from aiu_fms_testing_utils.utils.aiu_setup import dprint
try:
    from prometheus_api_client import PrometheusConnect
except Exception:
    print("WARNING: Cannot import `prometheus_api_client`. Make sure the package is installed if you are trying to report resource utilization.")

@cfsarmiento
Copy link
Copy Markdown
Collaborator Author

Copying drive_paged_programs.py and resource_collection.py into 0.6.0 files dir don't seem to work. This is from resource_collection.py:

from aiu_fms_testing_utils.utils.aiu_setup import dprint
try:
    from prometheus_api_client import PrometheusConnect
except Exception:
    print("WARNING: Cannot import `prometheus_api_client`. Make sure the package is installed if you are trying to report resource utilization.")

The script should work regardless if that package is pip installed or not. If you need help, we can hop onto a Slack huddle or something to debug.

)

# Instantiate the Prometheus client for resource metric collection
p = instantiate_prometheus()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Prometheus is not installed here, it looks like p will be none. Further on in the script you use p in other functions. If p is none, does everything work properly, or are there None type errors?

Prometheus is used unconditionally no matter what. You may want to make this opt-in for the user with some sort of argument, then raise an error if it is not installed. Right now, the code fails, but simply prints a warning. It would be more clear if nothing even referenced the module if it was not requested and an explicit failure if it was.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it to not print out anything at all if it is not installed. p being none is fine since all the metric reporting only happens if it isn't none and those functions are the only things that actually use p. I ran the script multiple times on my end to make sure that it ran as expected whether the package is installed or not. If you look above in the description, I added screenshots of expected output and you can see that if the package is installed, it will provide resource metrics, otherwise it just tells you what stage the script is at.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! My only suggestion then is to make the parameters for p optional in the type hints. This makes it clear to anyone using the functions that p is not required.

@thanh-lam
Copy link
Copy Markdown

@cfsarmiento - Finally, I can install AFTU 0.7.0 and try to run with your scripts. But, the test failed with DtException:

[ 0/ 4]: PT compile complete, took 470.762s
[ 3/ 4]: PT compile complete, took 496.284s
[ 1/ 4]: PT compile complete, took 498.586s
[ 2/ 4]: PT compile complete, took 501.217s
[ 0/ 4]: extracted prompts in 57.7897 seconds
[ 0/ 4]: *** testing program ProgramCriteria(program_id=19) ***
[ 0/ 4]: program id: ProgramCriteria(program_id=19), valid prompt: (1, 128), input shape: torch.Size([1, 128])
[2026-02-27:16:10:53] Inference started
[2026-02-27:16:11:18] Inference started
[2026-02-27:16:11:21] Inference started
[2026-02-27:16:11:23] Inference started
W0227 16:12:58.404000 74984 torch/_dynamo/backends/common.py:47] [0/2] aot_autograd-based backend ignoring extra kwargs {'options': {'sendnn.dynamic': True}}
W0227 16:12:58.416000 74985 torch/_dynamo/backends/common.py:47] [0/2] aot_autograd-based backend ignoring extra kwargs {'options': {'sendnn.dynamic': True}}
W0227 16:12:58.418000 74983 torch/_dynamo/backends/common.py:47] [0/2] aot_autograd-based backend ignoring extra kwargs {'options': {'sendnn.dynamic': True}}
W0227 16:12:58.454000 74982 torch/_dynamo/backends/common.py:47] [0/2] aot_autograd-based backend ignoring extra kwargs {'options': {'sendnn.dynamic': True}}
[DeepRT] ======= DtException CAUGHT =====
compile_graph(DtException::DtException(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x98) [0x40d9b8]
/opt/ibm/spyre/deeptools/lib/libdeeprt.so(+0x29e6b) [0x7f7ed9239e6b]
/opt/ibm/spyre/runtime/lib/libdee_internal.so(dee::RunDeepRt(sengraph::Graph*)+0x82) [0x7f7ed94ab3d2]
/opt/ibm/spyre/runtime/lib/libdee_internal.so(dee::PBD::CompileGraph(sendnn::Graph*, sendnn::Graph const&, bool)+0xc02) [0x7f7ed94fae92]
/opt/ibm/spyre/runtime/lib/libdee_internal.so(sendnn::DTCompiler::CompileGraph(sendnn::Graph*, sendnn::Graph const&, bool)+0x841) [0x7f7ed951bab1]
compile_graph() [0x408c06]
/lib64/libc.so.6(+0x295d0) [0x7f7ed47625d0]
/lib64/libc.so.6(__libc_start_main+0x80) [0x7f7ed4762680]
compile_graph() [0x40a935]

DtException: Insufficient sengraphs passed for decoder compilation, expected >=2 but got 0, file /project_src/deeptools/deeprt/deeprt.cpp line 3303
[DeepRT] ======= dscGlobal.showOpt() =====

As shown in the log above, there were only the "Inference Started" print-outs. I thought before that, there should be the "Compilation" Started and Completed. Something is not working right. Here's my run command:

$ cat /ist-sandbox/IST/scripts/run-granite-ptandru.sh
#!/bin/bash

export DTCOMPILER_EXPORT_DIR=/tmp/
export DT_DEEPRT_VERBOSE=-1
export VLLM_DT_MAX_CONTEXT_LEN=8192
export VLLM_DT_MAX_BATCH_SIZE=8
export VLLM_DT_MAX_BATCH_TKV_LIMIT=131072

PYTHONPATH=/tmp/aiu_fms_testing_utils:$PYTHONPATH torchrun --nproc-per-node=4 /tmp/aiu_fms_testing_utils/scripts/drive_paged_programs.py --model_variant=/modelstore-shared/granite/4.0/8b --program_criteria_json_path=/tmp/dpp-print/test-criteria.json --dataset_type=sharegpt --skip_validation --programs "*:0,<8192" --prioritize_large_batch_sizes --enforce_homogeneous_prompt_programs --prefill_chunk_size=1024 --dataset_path=/modelstore-shared/datasets/granite/dpp/gpt.json

@thanh-lam
Copy link
Copy Markdown

My env with torch and fms levels:

$ pip3 list | grep -e torch -e fms
aiu-fms-testing-utils     0.7.0
fms-model-optimizer       0.8.1
ibm-fms                   1.7.0       /shared/IST/foundation-model-stack
torch                     2.7.1+cpu
torch_sendnn              1.1.1+0
torchao                   0.11.0
torchvision               0.22.1+cpu

@thanh-lam
Copy link
Copy Markdown

@cfsarmiento - Had you seen that DtException error while testing the script?

@cfsarmiento
Copy link
Copy Markdown
Collaborator Author

cfsarmiento commented Feb 27, 2026

@thanh-lam Here are the environment variables I run with for DPP:

export VLLM_DT_CHUNK_LEN=1024 
export VLLM_DT_MAX_BATCH_TKV_LIMIT=16384 
export VLLM_DT_MAX_BATCH_SIZE=4 
export VLLM_DT_MAX_CONTEXT_LEN=1024

i remember when we were initially setting up, setting these seemed to help some errors. You can try and let me know if you run into it again. I have other environment variables set, but those are more specific to the container configuration itself so I imagine it wouldn't have an effect in OpenShift, but I can provide them if it is still giving you an issue.

@thanh-lam
Copy link
Copy Markdown

@thanh-lam Here are the environment variables I run with for DPP:

export VLLM_DT_CHUNK_LEN=1024 
export VLLM_DT_MAX_BATCH_TKV_LIMIT=16384 
export VLLM_DT_MAX_BATCH_SIZE=4 
export VLLM_DT_MAX_CONTEXT_LEN=1024

i remember when we were initially setting up, setting these seemed to help some errors. You can try and let me know if you run into it again. I have other environment variables set, but those are more specific to the container configuration itself so I imagine it wouldn't have an effect in OpenShift, but I can provide them if it is still giving you an issue.

Nop, setting those didn't help. It failed with same errors.

@cfsarmiento
Copy link
Copy Markdown
Collaborator Author

@thanh-lam These are all the environment variables i pass:

export FLEX_DEVICE=VF 
export TORCH_SENDNN_CACHE_DIR=/opt/ibm/spyre/models/cache/ 
export TORCH_SENDNN_TEMP_CACHE_DIR=/opt/ibm/spyre/models/cache/
export TORCH_SENDNN_CACHE_ENABLE=1
export VLLM_SPYRE_USE_CB=1
export VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERS=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_SPYRE_USE_CHUNKED_PREFILL=1
export VLLM_DT_CHUNK_LEN=1024
export VLLM_DT_MAX_BATCH_TKV_LIMIT=16384
export VLLM_DT_MAX_BATCH_SIZE=4
export VLLM_DT_MAX_CONTEXT_LEN=1024

again some of these are just specific to the vLLM containers we use but it is worth a shot. I will say though, it seems like my changes in this PR are being picked up because I see those output messages and when I was running, after i got my specific configuration set up, I didn't run into any DTException errors like that.

@thanh-lam
Copy link
Copy Markdown

I took the two scripts and put it in the env. that I usually run and trying my regular run now. This one worked before. Now just with your scripts. And, it failed with the same errors. It could be the AFTU 0.7.0 level that I've not run before. So I'll "reverse" the scripts and try just a regular run with AFTU 0.7.0.

One thing I spot in the log this time:

[ 0/ 2]: extracted prompts in 9.8234 seconds
[ 0/ 2]: You requested 15 prompts but we were only able to get 10 valid prompts. We will be repeating the first prompt.
[ 0/ 2]: *** testing program ProgramCriteria(program_id=0) ***
[ 0/ 2]: program id: ProgramCriteria(program_id=0), valid prompt: (15, 7104), input shape: torch.Size([15, 7104])
[2026-02-27:16:56:53] Inference started <<<<< "CPU inference started"
[ 0/ 2]: cpu validation info found for seed=0 -- loading it
[2026-02-27:16:56:59] CPU inference completed

That "Inference started" messages we saw in other runs, it should be:

[2026-02-27:16:56:53] CPU Inference started

It makes sense because it follows by the "CPU Inference completed". This needs to be corrected in your script. Also, I thought you had "Compilation started". Does this also mean "CPU Compilation started"?

@cfsarmiento
Copy link
Copy Markdown
Collaborator Author

@thanh-lam I worked with @jameslivulpi and we removed the commit that changed that torch version. That change is now just in main so this branch shouldn't have anything specific to that change, which is good because that was not a change I made to begin with it was odd that that somehow got into my branch.

@thanh-lam
Copy link
Copy Markdown

@cfsarmiento - Just cloned the branch:

% git clone -b print-timestamp-and-resource-usage https://github.com/foundation-model-stack/aiu-fms-testing-utils.git
Cloning into 'aiu-fms-testing-utils'...
remote: Enumerating objects: 5678, done.
remote: Counting objects: 100% (198/198), done.
remote: Compressing objects: 100% (47/47), done.
remote: Total 5678 (delta 163), reused 151 (delta 151), pack-reused 5480 (from 2)
Receiving objects: 100% (5678/5678), 4.49 MiB | 2.07 MiB/s, done.
Resolving deltas: 100% (3472/3472), done.

% ls -ltr
total 72
-rw-r--r--   1 thanhlam  staff  11357 Feb 27 14:13 LICENSE
-rw-r--r--   1 thanhlam  staff   8596 Feb 27 14:13 README.md
-rw-r--r--   1 thanhlam  staff    631 Feb 27 14:13 RELEASING.md
drwxr-xr-x   6 thanhlam  staff    192 Feb 27 14:13 aiu_fms_testing_utils
-rw-r--r--   1 thanhlam  staff    215 Feb 27 14:13 code-of-conduct.md
drwxr-xr-x   4 thanhlam  staff    128 Feb 27 14:13 examples
-rw-r--r--   1 thanhlam  staff   2622 Feb 27 14:13 pyproject.toml
drwxr-xr-x  11 thanhlam  staff    352 Feb 27 14:13 scripts
drwxr-xr-x   9 thanhlam  staff    288 Feb 27 14:13 tests

% git branch
* print-timestamp-and-resource-usage

% grep torch pyproject.toml
"torch==2.10.0",

@cfsarmiento
Copy link
Copy Markdown
Collaborator Author

@cfsarmiento - Just cloned the branch:

% git clone -b print-timestamp-and-resource-usage https://github.com/foundation-model-stack/aiu-fms-testing-utils.git
Cloning into 'aiu-fms-testing-utils'...
remote: Enumerating objects: 5678, done.
remote: Counting objects: 100% (198/198), done.
remote: Compressing objects: 100% (47/47), done.
remote: Total 5678 (delta 163), reused 151 (delta 151), pack-reused 5480 (from 2)
Receiving objects: 100% (5678/5678), 4.49 MiB | 2.07 MiB/s, done.
Resolving deltas: 100% (3472/3472), done.

% ls -ltr
total 72
-rw-r--r--   1 thanhlam  staff  11357 Feb 27 14:13 LICENSE
-rw-r--r--   1 thanhlam  staff   8596 Feb 27 14:13 README.md
-rw-r--r--   1 thanhlam  staff    631 Feb 27 14:13 RELEASING.md
drwxr-xr-x   6 thanhlam  staff    192 Feb 27 14:13 aiu_fms_testing_utils
-rw-r--r--   1 thanhlam  staff    215 Feb 27 14:13 code-of-conduct.md
drwxr-xr-x   4 thanhlam  staff    128 Feb 27 14:13 examples
-rw-r--r--   1 thanhlam  staff   2622 Feb 27 14:13 pyproject.toml
drwxr-xr-x  11 thanhlam  staff    352 Feb 27 14:13 scripts
drwxr-xr-x   9 thanhlam  staff    288 Feb 27 14:13 tests

% git branch
* print-timestamp-and-resource-usage

% grep torch pyproject.toml
"torch==2.10.0",

That is correct since that is what is in main now, my commit before was to just to show that I was not the one who made that change. If you are running into issues with 2.10.0, I would just bump the version down to 2.7.1 just so we can do a test to make sure the resource utilization reporting is working in OpenShift.

@thanh-lam
Copy link
Copy Markdown

Started with a fresh new container and AFTU 0.6.0 then replaced the two scripts:

  • drive_paged_programs.py
  • resource_collection.py

Ran the same command and it failed with different error:

[ 1/ 4]: PT compile complete, took 128.051s
Traceback (most recent call last):
  File "/tmp/aiu_fms_testing_utils/scripts/drive_paged_programs.py", line 1526, in <module>
    main()
  File "/tmp/aiu_fms_testing_utils/scripts/drive_paged_programs.py", line 1477, in main
    valid_prompts = prepare_test_prompts(
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/aiu_fms_testing_utils/scripts/drive_paged_programs.py", line 1196, in prepare_test_prompts
    program_criteria_json_list = json.load(f)["programs"]
                                 ^^^^^^^^^^^^
  File "/usr/lib64/python3.12/json/__init__.py", line 293, in load
    return loads(fp.read(),
           ^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/json/decoder.py", line 356, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
W0227 19:48:42.504000 163 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 228 closing signal SIGTERM
Signal Received: 15 (Terminated)

This is drive_paged_programs.py now. Probably the one you used to add your code was different than the one in AFTU 0.6.0. I'll try to get a container with torch 2.10 at this point. My thought is: You need to add your code into drive_paged_programs.py from the 0.6.0 level, in order for it to work with 0.6.0 and torch 2.7.1.

@thanh-lam
Copy link
Copy Markdown

With AFTU 0.7.1 and torch 2.10, started to see your print outs now. This is good sign:

 ERRR 27.02.2026 21:02:29.263748 [         software_counters.cpp: 119] Software counter key "MSI0" is already registered.
 ERRR 27.02.2026 21:02:29.263796 [         software_counters.cpp: 119] Software counter key "MSI1" is already registered.
[2026-02-27:21:04:04] Compilation completed
[ 2/ 4]: PT compile complete, took 123.922s
[2026-02-27:21:04:08] Compilation completed
[ 1/ 4]: PT compile complete, took 127.723s
[2026-02-27:21:04:08] Compilation completed
[ 3/ 4]: PT compile complete, took 128.147s
[2026-02-27:21:04:08] Compilation completed
[ 0/ 4]: PT compile complete, took 128.316s
[2026-02-27:21:04:53] AIU Inference started
[2026-02-27:21:04:56] AIU Inference started
[2026-02-27:21:04:57] AIU Inference started

Hoping to have more when this run completes.

@thanh-lam
Copy link
Copy Markdown

thanh-lam commented Feb 27, 2026

Okay, the run completed with no issue this time. Looking into log and there're "Compilation" started and completed times. But, there were no "Peak resource usage" print-outs. Maybe you can look into it next week. @cfsarmiento

$ grep -A1 Compilation run-granite-ptandru-4aiu-4mbs-1Klen-aftu071-log.txt
[2026-02-27:21:02:00] Compilation started
[ 2/ 4]: AIU warmup
[2026-02-27:21:02:00] Compilation started
[ 1/ 4]: AIU warmup
[2026-02-27:21:02:00] Compilation started
[ 0/ 4]: AIU warmup
[2026-02-27:21:02:00] Compilation started
W0227 21:02:04.650000 2231 torch/_dynamo/backends/common.py:53] [0/0] aot_autograd-based backend ignoring extra kwargs {'options': {'sendnn.dynamic': True}}
--
[2026-02-27:21:04:04] Compilation completed
[ 2/ 4]: PT compile complete, took 123.922s
[2026-02-27:21:04:08] Compilation completed
[ 1/ 4]: PT compile complete, took 127.723s
[2026-02-27:21:04:08] Compilation completed
[ 3/ 4]: PT compile complete, took 128.147s
[2026-02-27:21:04:08] Compilation completed
[ 0/ 4]: PT compile complete, took 128.316s

There're also a bunch of "Inference" started and completed print-outs (more than expected):

[2026-02-27:21:06:17] AIU Inference started
[2026-02-27:21:06:19] AIU inference completed

Counting the messages:

$ grep -c "Inference started" run-granite-ptandru-4aiu-4mbs-1Klen-aftu071-log.txt
36

$ grep -c -i "inference completed" run-granite-ptandru-4aiu-4mbs-1Klen-aftu071-log.txt
35

That included both CPU and AIU Inferences, it looked like. Anyway, getting it run is good progress. We can talk more next week.

@thanh-lam
Copy link
Copy Markdown

BTW, it's important to note the levels as follows that make it work this time:

$ pip3 list | grep -e fms -e torch
aiu-fms-testing-utils     0.7.1.dev37+g60013ae98 /shared/IST/aiu-fms-testing-utils.ptandru
fms-model-optimizer       0.8.1
ibm-fms                   1.7.0                  /shared/IST/foundation-model-stack
torch                     2.10.0+cpu
torch_sendnn              1.2.0+main.1.7a6dd9b.0
torchao                   0.11.0
torchaudio                2.10.0+cpu
torchvision               0.25.0+cpu

@thanh-lam
Copy link
Copy Markdown

@cfsarmiento - Following errors are in the output logs. They probably have to do with the problem of no "Peak Resource Utilization " print-outs:

ERRR 27.02.2026 21:40:09.528114 [         software_counters.cpp: 119] Software counter key "MSI0" is already registered.
ERRR 27.02.2026 21:40:09.528153 [         software_counters.cpp: 119] Software counter key "MSI1" is already registered.

@cfsarmiento
Copy link
Copy Markdown
Collaborator Author

@thanh-lam Can you send me the full log on slack or something? If you are not seeing usage information, it is because of one of the following:

  1. You did not pip install prometheus_api_client
  2. Environment variable PROMETHEUS_URL is not set
  3. Environment variable PROMETHEUS_API_KEY is not set

I think the reason why you are seeing the messages multiple times is because the code seems to execute those functions for each card (rank).

As for these errors:

ERRR 27.02.2026 21:02:29.263748 [         software_counters.cpp: 119] Software counter key "MSI0" is already registered.
 ERRR 27.02.2026 21:02:29.263796 [         software_counters.cpp: 119] Software counter key "MSI1" is already registered.

I am not sure what would be causing this, I would need more information on this here but I personally haven't seen this error in my runs so it might be an OpenShift-specific thing.

@thanh-lam
Copy link
Copy Markdown

thanh-lam commented Mar 3, 2026

@cfsarmiento - Here's some sample of the print-outs:


[2026-03-02:21:03:05] Compilation started - CPU: 1.15%, Memory: 9.39 GB
[2026-03-02:21:11:15] Compilation completed - CPU: 3.23%, Memory: 9.26 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 4.65%, Memory: 155.40 GB
[2026-03-02:21:11:48] CPU Inference started - CPU: 2.97%, Memory: 9.31 GB
[2026-03-02:21:11:49] CPU inference completed - CPU: 2.97%, Memory: 9.31 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 2.97%, Memory: 97.78 GB
[2026-03-02:21:11:49] AIU Inference started - CPU: 2.97%, Memory: 9.31 GB
[2026-03-02:21:13:31] AIU inference completed - CPU: 0.99%, Memory: 9.27 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 3.00%, Memory: 233.20 GB
[2026-03-02:21:13:33] CPU Inference started - CPU: 0.99%, Memory: 9.27 GB
[2026-03-02:21:13:34] CPU inference completed - CPU: 0.99%, Memory: 9.27 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 0.99%, Memory: 132.77 GB
[2026-03-02:21:13:35] AIU Inference started - CPU: 0.99%, Memory: 9.27 GB
[2026-03-02:21:14:32] AIU inference completed - CPU: 1.45%, Memory: 9.28 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 1.47%, Memory: 224.09 GB

That looks good from what I can see. Will verify the values with what we see from Grafana Dashboard. Note a couple things:

  • This "inference" needs to be changed to "Inference" for the sake of "extracting" them both.
  • It looks like you printed "Peak Resource Utilization" after "Compilation" or "Inference" completed, right? So that can be assumed the peak value during the "Compilation" or "Inference"

Signed-off-by: Christian Sarmiento <christian.sarmiento@ibm.com>
@cfsarmiento
Copy link
Copy Markdown
Collaborator Author

@thanh-lam I added the change. As for your question for the peak resource utilization, that is correct. Peak resource utilization belongs to whatever stage precedes it.

@thanh-lam
Copy link
Copy Markdown

@cfsarmiento - As for the issue of installing prometheus-api-client, can you add a command line parameter as an option? Some thing like:

--Resource_Utils

Then, all your code will be under this condition. That also means: Users run DPP with --Resource_Utils only when they want to see these print-outs. You also check for prometheus-api-client installation that'll print out ERROR if it's not installed.

The alternative to command line option is to use an ENV. VAR. like:

RESOURCE_UTILS

That's set to False by default. Users can set it to True to enable the check for installation and print-outs in the logs.

@matthew-pisano matthew-pisano self-requested a review March 5, 2026 14:55
Copy link
Copy Markdown
Collaborator

@matthew-pisano matthew-pisano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Signed-off-by: Christian Sarmiento <christian.sarmiento@ibm.com>
Signed-off-by: Christian Sarmiento <christian.sarmiento@ibm.com>
Signed-off-by: Christian Sarmiento <christian.sarmiento@ibm.com>
@cfsarmiento
Copy link
Copy Markdown
Collaborator Author

@thanh-lam I have added the change for adding a flag. Now, if you pass along --report_resource_utilization, it will output resource utilization metrics (assuming you have PROMETHEUS_URL and PROMETHEUS_API_KEY set) and will also install the Prometheus python package if not already installed. If that flag isn't passed, all you will see is the staging for when compilation/inference starts/ends. I have tested it fully on Z so it should work fine, but let me know if you run into any issues on OpenShift. Am I good to merge?

@thanh-lam
Copy link
Copy Markdown

Notes: Found some "discrepancy" while comparing DPP print-outs for "Peak Resource Utilization" with memory:

[ 0/ 4]: Peak Resource Utilization - CPU: 13.00%, Memory: 503.63 GB
[2026-03-09:14:59:53] CPU Inference started - CPU: 12.78%, Memory: 8.82 GB
[2026-03-09:14:59:53] CPU Inference completed - CPU: 12.78%, Memory: 8.82 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 12.73%, Memory: 461.93 GB
[2026-03-09:14:59:53] AIU Inference started - CPU: 12.73%, Memory: 8.82 GB
[2026-03-09:15:00:07] AIU Inference completed - CPU: 15.07%, Memory: 8.82 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 13.91%, Memory: 498.74 GB
[2026-03-09:15:00:18] CPU Inference started - CPU: 13.79%, Memory: 8.86 GB
[2026-03-09:15:00:18] CPU Inference completed - CPU: 13.79%, Memory: 8.86 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 13.79%, Memory: 498.74 GB
[2026-03-09:15:00:19] AIU Inference started - CPU: 13.78%, Memory: 8.86 GB
[2026-03-09:15:00:31] AIU Inference completed - CPU: 13.01%, Memory: 8.86 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 13.03%, Memory: 498.74 GB
[2026-03-09:15:00:35] CPU Inference started - CPU: 12.59%, Memory: 8.86 GB

Comparing the values on Grafana Dashboard as shown in the graph
Screenshot 2026-03-09 at 3 46 28 PM

@thanh-lam
Copy link
Copy Markdown

Had a discussion with @cfsarmiento but we didn't see what caused the different values by the two methods of querying data from Prometheus. From above:

  • DPP: Peak Resource Utilization = 498 - 503 GB
  • Grafana: 82 - 95 GB

We know that the code in DPP does this:

mem_query = '(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 1024 / 1024 / 1024'

But, need to find out how Grafana does it. One observation from above "formula":

  • The amount of available memory may be the "culprit" here
  • Maybe some "other processes" were occupying parts of the memory and made "available memory" smaller. Hence, mem_query came out larger

@thanh-lam
Copy link
Copy Markdown

I started a new test run with the same model and params as before. The DPP print outs look closer to Grafana's values this time.

[ 0/ 4]: Peak Resource Utilization - CPU: 4.77%, Memory: 157.88 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 3.86%, Memory: 157.74 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 3.93%, Memory: 157.78 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 3.66%, Memory: 157.76 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 4.19%, Memory: 157.81 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 4.14%, Memory: 157.81 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 4.14%, Memory: 157.81 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 3.88%, Memory: 157.87 GB
[ 0/ 4]: Peak Resource Utilization - CPU: 3.37%, Memory: 157.84 GB

These values are also more stable and not much fluctuation. It could be that this is a "newly" created pod while the "old" results were taken in a pod that had been running tests for many days.

@thanh-lam
Copy link
Copy Markdown

@cfsarmiento - We talked about this before: If PROMETHEUS_API_KEY was not set up, the tests failed. But, you thought it should not fail? Maybe I was missing something?

[ 0/ 4]: AIU warmup
/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py:1097: InsecureRequestWarning: Unverified HTTPS request is being made to host 'thanos-querier.openshift-monitoring.svc.cluster.local'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings
  warnings.warn(
[rank0]: Traceback (most recent call last):
[rank0]:   File "/shared/IST/aiu-fms-testing-utils.ptandru/scripts//drive_paged_programs.py", line 1534, in <module>
[rank0]:     main()
[rank0]:   File "/shared/IST/aiu-fms-testing-utils.ptandru/scripts//drive_paged_programs.py", line 1467, in main
[rank0]:     warmup_model(
[rank0]:   File "/shared/IST/aiu-fms-testing-utils.ptandru/aiu_fms_testing_utils/utils/__init__.py", line 82, in warmup_model
[rank0]:     metric_start = print_step(profile, print_utilization, "started", "Compilation")
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/shared/IST/aiu-fms-testing-utils.ptandru/aiu_fms_testing_utils/utils/resource_collection.py", line 219, in print_step
[rank0]:     cpu_usage, mem_usage = get_static_read(p, recorded_time)
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/shared/IST/aiu-fms-testing-utils.ptandru/aiu_fms_testing_utils/utils/resource_collection.py", line 116, in get_static_read
[rank0]:     cpu_response = client.custom_query(query=cpu_query, params={"time": recorded_time.timestamp()})
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/tmp/.local/lib/python3.12/site-packages/prometheus_api_client/prometheus_connect.py", line 475, in custom_query
[rank0]:     raise PrometheusApiClientException(
[rank0]: prometheus_api_client.exceptions.PrometheusApiClientException: HTTP Status Code 401 (b'Unauthorized\n')
# ---------------------------
# COMPLETE: RTN=1
# ---------------------------

It's okay if PROMETHEUS_API_KEY is required to be set up. We just need to document it.

@thanh-lam
Copy link
Copy Markdown

After researching online and discussing with colleagues to understand all this, I think we can go back to basics and look into what's in /proc/meminfoand compare values in there with Grafana or DPP results (prometheus-api-client). Note: They both get data from Prometheus Data Sources.

To understand the numbers in /proc/meminfo, see RedHat document

@cfsarmiento
Copy link
Copy Markdown
Collaborator Author

@cfsarmiento - We talked about this before: If PROMETHEUS_API_KEY was not set up, the tests failed. But, you thought it should not fail? Maybe I was missing something?

[ 0/ 4]: AIU warmup
/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py:1097: InsecureRequestWarning: Unverified HTTPS request is being made to host 'thanos-querier.openshift-monitoring.svc.cluster.local'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings
  warnings.warn(
[rank0]: Traceback (most recent call last):
[rank0]:   File "/shared/IST/aiu-fms-testing-utils.ptandru/scripts//drive_paged_programs.py", line 1534, in <module>
[rank0]:     main()
[rank0]:   File "/shared/IST/aiu-fms-testing-utils.ptandru/scripts//drive_paged_programs.py", line 1467, in main
[rank0]:     warmup_model(
[rank0]:   File "/shared/IST/aiu-fms-testing-utils.ptandru/aiu_fms_testing_utils/utils/__init__.py", line 82, in warmup_model
[rank0]:     metric_start = print_step(profile, print_utilization, "started", "Compilation")
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/shared/IST/aiu-fms-testing-utils.ptandru/aiu_fms_testing_utils/utils/resource_collection.py", line 219, in print_step
[rank0]:     cpu_usage, mem_usage = get_static_read(p, recorded_time)
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/shared/IST/aiu-fms-testing-utils.ptandru/aiu_fms_testing_utils/utils/resource_collection.py", line 116, in get_static_read
[rank0]:     cpu_response = client.custom_query(query=cpu_query, params={"time": recorded_time.timestamp()})
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/tmp/.local/lib/python3.12/site-packages/prometheus_api_client/prometheus_connect.py", line 475, in custom_query
[rank0]:     raise PrometheusApiClientException(
[rank0]: prometheus_api_client.exceptions.PrometheusApiClientException: HTTP Status Code 401 (b'Unauthorized\n')
# ---------------------------
# COMPLETE: RTN=1
# ---------------------------

It's okay if PROMETHEUS_API_KEY is required to be set up. We just need to document it.

Yes, I implemented it in a way where if either of those environment variables are not set or if the package is not installed, it should still run anyways. Either way, it seems like you are running into errors regardless so I will be sure to add some more verbose error handling. If you are running on Z and have the --report_resource_utilization flag set, you only need the PROMETHEUS_URL environment variable. Otherwise, if you are on OpenShift, you need both PROMETHEUS_URL and PROMETHEUS_API_KEY.

@thanh-lam
Copy link
Copy Markdown

@cfsarmiento - Thanks!
This would be a good documentation point:

If you are running on Z and have the --report_resource_utilization flag set, you only need the PROMETHEUS_URL environment variable. Otherwise, if you are on OpenShift, you need both PROMETHEUS_URL and PROMETHEUS_API_KEY.

Maybe, Z is a secured env. and it has other security procedures that the token authentication is not significant. (Just my guess -- Don't document this point)

Signed-off-by: Christian Sarmiento <christian.sarmiento@ibm.com>
Signed-off-by: Christian Sarmiento <christian.sarmiento@ibm.com>
@cfsarmiento
Copy link
Copy Markdown
Collaborator Author

@cfsarmiento - We talked about this before: If PROMETHEUS_API_KEY was not set up, the tests failed. But, you thought it should not fail? Maybe I was missing something?

[ 0/ 4]: AIU warmup
/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py:1097: InsecureRequestWarning: Unverified HTTPS request is being made to host 'thanos-querier.openshift-monitoring.svc.cluster.local'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings
  warnings.warn(
[rank0]: Traceback (most recent call last):
[rank0]:   File "/shared/IST/aiu-fms-testing-utils.ptandru/scripts//drive_paged_programs.py", line 1534, in <module>
[rank0]:     main()
[rank0]:   File "/shared/IST/aiu-fms-testing-utils.ptandru/scripts//drive_paged_programs.py", line 1467, in main
[rank0]:     warmup_model(
[rank0]:   File "/shared/IST/aiu-fms-testing-utils.ptandru/aiu_fms_testing_utils/utils/__init__.py", line 82, in warmup_model
[rank0]:     metric_start = print_step(profile, print_utilization, "started", "Compilation")
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/shared/IST/aiu-fms-testing-utils.ptandru/aiu_fms_testing_utils/utils/resource_collection.py", line 219, in print_step
[rank0]:     cpu_usage, mem_usage = get_static_read(p, recorded_time)
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/shared/IST/aiu-fms-testing-utils.ptandru/aiu_fms_testing_utils/utils/resource_collection.py", line 116, in get_static_read
[rank0]:     cpu_response = client.custom_query(query=cpu_query, params={"time": recorded_time.timestamp()})
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/tmp/.local/lib/python3.12/site-packages/prometheus_api_client/prometheus_connect.py", line 475, in custom_query
[rank0]:     raise PrometheusApiClientException(
[rank0]: prometheus_api_client.exceptions.PrometheusApiClientException: HTTP Status Code 401 (b'Unauthorized\n')
# ---------------------------
# COMPLETE: RTN=1
# ---------------------------

It's okay if PROMETHEUS_API_KEY is required to be set up. We just need to document it.

Yes, I implemented it in a way where if either of those environment variables are not set or if the package is not installed, it should still run anyways. Either way, it seems like you are running into errors regardless so I will be sure to add some more verbose error handling. If you are running on Z and have the --report_resource_utilization flag set, you only need the PROMETHEUS_URL environment variable. Otherwise, if you are on OpenShift, you need both PROMETHEUS_URL and PROMETHEUS_API_KEY.

Hey @thanh-lam, I have added some error handling and this issue should now be fixed. Try running this test case again and see if you run into the same issue (you shouldn't but that's why we test :) ). Let me know if it is good so I can merge this in. I also added updated instructions for running with resource reporting, let me know if I added the instructions to the correct README and if they are clear enough. Thank you!

Signed-off-by: Christian Sarmiento <christian.sarmiento@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants